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Abstract 



Many least squares problems involve affine equality and inequality constraints. Although there 
are variety of methods for solving such problems, most statisticians find constrained estimation 
challenging. The current paper proposes a new path following algorithm for quadratic programming 
based on exact penalization. Similar penalties arise in l\ regularization in model selection. Classical 
penalty methods solve a sequence of unconstrained problems that put greater and greater stress 
on meeting the constraints. In the limit as the penalty constant tends to oo, one recovers the 
constrained solution. In the exact penalty method, squared penalties are replaced by absolute 
value penalties, and the solution is recovered for a finite value of the penalty constant. The exact 
path following method starts at the unconstrained solution and follows the solution path as the 
penalty constant increases. In the process, the solution path hits, slides along, and exits from the 
various constraints. Path following in lasso penalized regression, in contrast, starts with a large 
value of the penalty constant and works its way downward. In both settings, inspection of the 
entire solution path is revealing. Just as with the lasso and generalized lasso, it is possible to plot 
the effective degrees of freedom along the solution path. For a strictly convex quadratic program, 
the exact penalty algorithm can be framed entirely in terms of the sweep operator of regression 
analysis. A few well chosen examples illustrate the mechanics and potential of path following. 
Keywords: exact penalty, l\ regularization, shape restricted regression 



1 Introduction 



When constraints appear in estimation by maximum likelihood or least squares estimation, statisti- 
cians typically resort to sophisticated commercial software or craft specific optimization algorithms 
for specific problems. In this article, we develop a simple path algorithm for a general class of 
constrained estimation problems, namely quadratic programs with affine equality and inequality 
constraints. Besides providing constrained estimates, our new algorithm also delivers the whole so- 
lution path between the unconstrained and the constrained estimates. This is particularly helpful 
when the goal is to locate a solution between these two extremes based on criteria such as prediction 
error in cross-validation. 

In recent years several path algorithms have been devised for specific l\ regularization problems. 
The solution paths generated vividly illustrate the tradeoffs between goodness of fit and sparsity. For 
example, a modification of th e least angle regression ( LARS) procedure can handle lasso penalized 



regression (jEfron et al.. 20041) 



Rosset and Zhu (20071) give sufficient conditions for a solution path 



to be piecewise linear and expand its applications to a wider range of loss and penalty functions. 



Friedman (20081 ) derives a path algorithm for any objective function defined by the sum of a convex 



loss and a separable penalty (not necessarily convex) . The separability restriction on the penalty 



Tibshirani and Taylor (20111) devise a path 



term excludes many of the problems studied here, 
algorithm for generalized lasso problems. Their formulation is similar to ours, but there are two 
fundamental differences. First, inequality constraints are excluded in their formulation. Our new 
path algorithm handles both equality and inequality constraints gracefully. Second, they pass to 
the dual problem and then translate the solution path of the dual problem back to the solution 
path of the primal problem. In our view, attacking the primal problem directly leads to a simpler 
algorithm, indeed one driven entirely by the classical sweep operator of regression analysis. These 
gains in conceptual clarity and implementation ease constitute major pluses for statisticians. As we 



will show, the degrees of freedom formula derived fo r the lasso 



Efron et al.. 2 004; 



Zou et al.. 20071) 



and generalized lasso (jTibshirani and Taylor. 20111 ) apply equally well in the presence of inequality 
constraints. 
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Our object of study will be minimization of the quadratic function 

f{x) = -xtAx + tfx + c (1) 

subject to the affine equality constraints Vx = d and the afhne inequality constraints Wx < e. 
Throughout our discussion we assume that the feasible region is nontrivial and that the minimum is 
attained. If the symmetric matrix A has a negative eigenvalue A and corresponding unit eigenvector 
u, then lmv^oo f(ru) — — oo because the quadratic term i(ru)* A(ru) = -|r 2 dominates the linear 
term rb l u. To avoid such behavior, we initially assume that all eigenvalues of A are positive. This 
makes f(x) strictly convex and coercive and guarantees a unique minimum point subject to the 
constraints. In linear regression A = X 1 X for some design matrix X . In this setting A is positive 
definite provided X has full column rank. The latter condition is only possible when the number 
of cases equals or exceeds the number of predictors. If A is positive s emidefinite and singular, the n 
adding a small amount of ridge regularization el to it can be helpful (jTibshirani and Taylor. 20111 ). 
Later we indicate how path following extends to positive semidefinite or even indefinite matrices 
A. 

In multi-task models in machine learning, the response is a d-dimensional vector Y £ R d , and 
one minimizes the squared Frobenius deviation 

\\\Y-XB\\l (2) 

with respect to the p x d regression coefficient matrix B. When the constraints take the form 
VB < D and WB = E, the problem reduces to quadratic programming as just posed. Indeed, 
if we stack the columns of Y with the vec operator, then the problem reduces to minimizing 
|||vec(y) — I <E> Xvec(B)|||. Here the identity vec(XB) = J® Xvec(B) comes into play involving 
the Kronecker product and the identity matrix i\ The same identity allows to rewrite the constraints 
as I ® Vvec(X) = vec(-D) and J ® Wvec(X) < vec(-E). 



As an illustration, consider the classical concave regression problem (jHildreth. 19541 ) . The data 
consist of a scatter plot (xj, j/j) of n points with associated weights Wi and predictors arranged in 
increasing order. The concave regression problem seeks the estimates 0$ that minimize the weighted 
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sum of squares 



subject to the concavity constraints 



The consistency of concave regression is proved by 



Y^^-^f ( 3 ) 



> 1+1 - , i = 2,...,n-l. (4) 



Hanson and Pledger ( 19761 ): the asy mptotic dis 



tribution of the estimates and their rate of convergence are studied in subsequent papers ([Mammen. 1991 



Groeneboom et al.. 20011 ). Figure [T] shows a scatter plot of 100 data points. Here the Xi are uni- 
formly sampled from the interval [0,1], the weights are constant, and y% — 4a;j(l — xi) + ej, where 
the Ci are i.i.d. normal with mean and standard deviation a = 0.3. The left panel of Figure [T] gives 
four snapshots of the solution path. The original data points Qi = yi provide the unconstrained 
estimates. The solid line shows the concavity constrained solution. The dotted and dashed lines 
represent intermediate solutions between the unconstrained and constrained solutions. The degrees 
of freedom formula derived in Section [6] is a vehicle for model selection based on criterion such as 
C p , AIC, and BIC. For example, the C p statistic 

c p (e) = i\\y-eg + l a M{(e) 



is an unbiased estimator of the true prediction error ([Efron. 20041 ) under the estimator 6. The 
right panel shows the C p statistic along the solution path. In this example the design matrix is a 
diagonal matrix. As we will see in Section postulating a more general design matrix or other 
kinds of constraints broadens the scope of applications of the path algorithm and the estimated 
degrees of freedom. 

Here is a roadmap to the remainder the current paper. Section [5] reviews the exact penalty 
method for optimization and clarifies the connections between constrained optimization and reg- 
ularization in statistics. Section [3] derives in detail our path algorithm. Its implementation via 
the sweep operator and QR decomposition are described in Sections U and [5] Section [5] derives 
the degrees of freedom formula. Section [7] presents various numerical examples. Finally, Section [S] 
discusses the limitations of the path algorithm and hints at future generalizations. 
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Figure 1: Path solutions to the concave regression problem. Left: the unconstrained solution 
(original data points), two intermediate solutions (dotted and dashed lines), and the concavity 
constrained solution (solid line). Right: the C p statistic as a function of the penalty constant p 
along the solution path. 

2 The Exact Penalty Method 

Exact penalty methods minimize the function 

r s 

t=l j=l 

where f(x) is the objective function, Qi{x) = is one of r equality constraints, and hj{x) < is one 
of s inequality constraints. It is interesting to compare this function to the Lagrangian function 

r s 

£{x) = f{x) + h9i{x) +^2pjhj(x) 

i=l j=l 

that captures the behavior of f(x) at a constrained local minimum y. By definition the Lagrange 
multipliers satisfy the conditions V£(y) = and pj > and pjhj(y) = for all j. In the exact 
penalty method we take 



p > max{|Ai|,...,|A r |,/ii,...,/i s }. 



(5) 



This choice creates the majorization f(x) < £ p (x) with f(z) = £ P (z) at any feasible point z. Thus, 
minimizing £ p {x) forces f{x) downhill. Much more than this is going on however. As the next 
proposition proves, minimizing £ p (x) effectively minimizes f(x) subject to the constraints. 
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Proposition 2.1. Suppose the objective function f(x) and the constraint functions are twice dif- 
ferentiate and satisfy the Lagrange multiplier rule at the local minimum y. If inequality J3J) holds 
and v*d 2 C(y)v > for every vector v 7^ satisfying dgi(y)v = and dhj(y)v < for all active 
inequality constraints, then y furnishes an unconstrained local minimum of £ p (x). If f(x) is con- 
vex, the g%(x) are affine, the hj(x) are convex, and Slater's constraint qualification holds, then y 
is a minimum of £ p (x) if and only if y is a minimum of f(x) subject to the constraints. In this 
convex programming context, no differentiability assumptions are needed. 

Proof: The conditions imposed on the quadratic form v* d 2 C(y)v > are well-known su fficient 



conditions for a local minimum. Theorems 6.9 and 7.21 of the reference (jRuszczynski. 20061) prove 



all of the foregoing assertions. □ 

3 The Path Following Algorithm 

We now resume our study of minimizing the objective function ([l} subject to the affine equality 
constraints Vx = d and the affine inequality constraints Wx < e. The corresponding penalized 
objective function takes the form 

1 r s 

£ p (x) = -x t Ax + b t x + c + p^2\v t i x-d l \+p^2{w t j x-e i ) + . (6) 

i=i j=i 

Our assumptions on A render £ p (x) strictly convex an d coercive and guarantee a un ique minimum 



point x(p). The generalized lasso problem studied in (|Tibshirani and Taylor. 20111 ) drops the last 



term and consequently excludes inequality constrai ned applications 



According to the rules of the convex calculus (|Ruszczvhski. 20 06). the unique optimal point 



x{p) of the function £ p (x) is characterized by the stationarity condition 



= Ax{p) + b + p^2 sWi + p^tjWj (7) 



with coefficients 



'{-1} v\x-di<0 ({0} w t j x-e l <0 

Si e{ [-1,1] v\x-d t = 0, ^6 [0,1] w t j x-e i = 0. (8) 

{1} v\x-d t >Q [{1} w)x-e l >Q 
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Assuming the vectors (Ui{vi}) U (Uj{wj}) are linearly independent, the coefficients Sj and tj are 
uniquely determined. The sets defining the possible values of Sj and tj are the subdifferentials of 
the functions |sj| and (tj)+ = max{0,^}. 

The solution path is continuous when A is positive definite. This also implies that the 
coefficient paths s(p) and t(p) are continuous. For a rigorous proof, note that the representation 

r s 

x(p) = -A~ 1 (b + p^SjVi + py^JjWj) 
i=i j=i 

entails the norm inequality 

||x(p)|| < IIA^II^Ibll+p^lKII+p^lKIl). 

»=i j=i 

Thus, the solution vector a;(p) is bounded whenever p > is bounded above. To prove continuity, 
suppose that it fails for a given p. Then there exists an e > and a sequence p n tending to p such 
||x(p„) — cc(p)|| > e for all n. Since x(p n ) is bounded, we can pass to a subsequence if necessary and 
assume that x(p n ) converges to some point y. Taking limits in the inequality £ Pn [x(p n )] < £ Pn {x) 
demonstrates that £ p (y) < £ P (x) for all x. Because x(p) is unique, we reach the contradictory 
conclusions \\y — x{p)\\ > e and y = x(p). Continuity is inherited by the coefficients Sj and tj. 
Indeed, let V and W be the matrices with rows v\ and Wj, and let U be the block matrix 
The stationarity condition can be restated as 

= Ax + b + pU l 

Multiplying this equation by U and solving give 



(9) 

and the continuity of the left-hand side follows from the continuity of x(p). Finally, dividing by p 
yields the continuity of the coefficient Sj and tj for p > 0. 

We next show that the solution path is piecewise linear. Along the path we keep track of the 
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-{UU^U Ax(p)+b 



following index sets determined by the constraint residuals: 

W E = {i : v\x - d l < 0}, Mi = {j : w]x - e < 0} 

Z E = {i : v\x -di = 0}, Zi = {j : w]x - ej = 0} 

Ve = {i ■ v\x -d t > 0}, Vi = {j : w)x - ej > 0}. 

For the sake of simplicity, assume that at the beginning of the current segment Sj does not equal — 1 
or 1 when i G Ze and tj does not equal or 1 when j £ Z\. In other words, the coefficients of the 
active constraints occur on the interior of their subdiffcrcntials. Let us show in this circumstance 
that the solution path can be extended in a linear fashion. The general idea is to impose the 
equality constraints Vz E x = dz E and W ZjX = ez l and write the objective function £ p {x) as 

\^x l Ax + b t x + c- p ^2 ( v i x - d i)+ ( v l x ~di)+ {w)x - e 3 ). 

ieNe iev E jeVi 

For notational convenience define 

Uz = ( ) , c z = ( d f E ) , u z = - v i + v * + w i- 

\w Zl J V e zi J ieMs iePE jeVl 

Minimizing £ p (x) subject to the constraints generates the Lagrange multiplier problem 



A U t z \f x \ ( -b-puz 

U z \ X z J \ c z 



with 



Q = A- X U Z {U zA^UzY 1 
R = -(UzA^UzY 1 - 



(10) 



with the explicit path solution and Lagrange multipliers 

x{p) = -P(b + puz) + Qc z = pPu z Pb + Qcz (11) 



Here 



= Q l b + Rc z 


-pQ*Uz- 




(12) 




= ( A 














P = A' 1 - A 


^(UzA- 


^UzY'UzA- 1 





As we will see in the next section, these seemingly complicated objects arise naturally if path 
following is organized around the sweep operator. 

It is clear that as we increase p, the solution path (fTTj) changes in a linear fashion until either 
an inactive constraint becomes active or the coefficient of an active constraint hits the boundary 
of its subdiffcrcntial. We investigate the first case first. Imagining p to be a time parameter, an 
inactive constraint i £ A/"e U Ve becomes active when 

v\x(p) = -v\P(b + pUz)+v\Qc z = di. 

If this event occurs, it occurs at the hitting time 

n« = 



v\Pb + v\Qc z - di 



v\Puz 



(13) 



Similarly, an inactive constraint j G A/j U V\ becomes active at the hitting time 

_ -w)Pb + W )Qc z -e 2 
P w)Pu z ' [ ' 

To determine the escape time for an active constraint, consider once again the stationarity 
condition ([7|). The Lagrange multiplier corresponding to an active constraint coincides with a 
product p-Si(p) or ptj(p). Therefore, if we collect the coefficients for the active constraints into the 
vector r z (p), then equation $12\i implies 

r z { P ) = -\z(p) = -( Q'b + Rcz) Q l u z . (15) 
P P 

Formula (|T5j) for r z (p) can be rewritten in terms of the value r z (p ) at the start po of the current 
segment as 

r z (p) = —r z (p ) - fl-—^\ Q'u z . (16) 
P \ P J 

It is clear that r z (p)i is increasing in p when {r z (po) + Q t u z ]i < and decreasing in p when the 
reverse is true. The coefficient of an active constraint i £ -Ee escapes at either of the times 

(i) [-Q'b + Rcz], [-Qtb + Rczk 
P ~ — } — t — ~-i or 



[Q'uz], - 1 [Q f u z ] t + 1 
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whichever is pertinent. Similarly, the coefficient of an active constraint j 6 Z\ escapes at either of 
the times 

{J) _ [-Q% + Rcz\j [ Q'b + Rc z ]j 

whichever is pertinent. The earliest hitting time or escape time over all constraints determines the 
duration of the current linear segment. 

At the end of the current segment, our assumption that all active coefficients occur on the interior 
of their subdifferentials is actually violated. When the hitting time for an inactive constraint occurs 
first, we move the constraint to the appropriate active set Z-& or Z\ and keep the other constraints in 
place. Similarly, when the escape time for an active constraint occurs first, we move the constraint 
to the appropriate inactive set and keep the other constraints in place. In the second scenario, if 
Si hits the value —1, then we move i to Ae- If Sj hits the value 1, then we move i to Ve- Similar 
comments apply when a coefficient tj hits or 1 . Once this move is executed, we commence a new 
linear segment as just described. The path following algorithm continues segment by segment until 
for sufficiently large p the sets Ae, V-e, and V\ are exhausted, if j = 0, and the solution vector (JTTJ) 
stabilizes. 

This description omits two details. First, to get the process started, we set p = and x(0) = 
— A _1 b. In other words, we start at the unconstrained minimum. For inactive constraints, the 
coefficients s 2 (0) and tj(0) are fixed. However for active constraints, it is unclear how to assign 
the coefficients and whether to release the constraints from active status as p increases. Second, 
very rarely some of the hitting times and escape times will coincide. We are then faced again with 
the problem of which of the active constraints with coefficients on their subdifferential boundaries 
to keep active and which to encourage to go inactive in the next segment. In practice, the first 
problem can easily occur. Roundoff error typically keeps the second problem at bay. 

In both anomalous cases, the status of each of active constraint can be resolved by trying all 
possibilities. Consider the second case first. If there are a currently active constraints parked at 
their subdifferential boundaries, then there are 2 a possible configurations for their active-inactive 
states in the next segment. For a given configuration, we can exploit formula (1151) to check whether 
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the coefficient for an active constraint occurs in its subdifferential. If the coefficient occurs on the 
boundary of its subdifferential, then we can use representation (|16p to check whether it is headed 
into the interior of the subdifferential as p increases. Since the path and its coefficients are unique, 
one and only one configuration should determine the next linear segment. At the start of the path 
algorithm, the correct configuration also determines the initial values of the active coefficients. If 
we take limits in equation (|15[) as p tends to 0, then the coefficients will escape their subdifferentials 
unless — Q l b + Rcz = and all components of — Q'u^? lie in their appropriate subdifferentials. 
Hence, again it is easy to decide on the active set Z going forward from p = 0. One could object 
that the number of configurations 2 a is potentially very large, but in practice this combinatorial 
bottleneck never occurs. Visiting the various configurations can be viewed as a systematic walk 



through the subsets of {l,...,a} and organized using a classical gray code (jSavage. 19971) that 
deletes at most one element and adjoins at most one element as one passes from one active subset 
to the next. As we will see in the next section, adjoining an element corresponds to sweeping a 
diagonal entry of a tableau and deleting an element corresponds to inverse sweeping a diagonal 
entry of the same tableau. 



4 The Path Algorithm and Sweeping 

Implementation of the path algorithm can be conveniently or g anized around th e 



verse sweep operators of regression analysis ([Dempster. 1969 ; 



Little and Rubin. 2002 ; 



Goodnight. 1979 ; 



sweep and in- 



Jennrich. 1977 



Lange. 20101 ) . We first recall the definition and basic properties of the 
sweep operator. Suppose A is an m x m symmetric matrix. Sweeping on the fcth diagonal entry 
a-kk 7^ of A yields a new symmetric matrix A with entries 

1 

akk — , 

flfefe 

a-ik . , , 

a lk = , i 7* k 

dkk 

akj . , , 

akj = , 3 T k 

akk 

_ a^akj . . / j 

a%j — CLij , 2, J K. 
akk 
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These arithmetic operations can be undone by inverse sweeping on the same diagonal entry. Inverse 
sweeping sends the symmetric matrix A into the symmetric matrix A with entries 

1 

a-kk ' 



afefe 
dik 
&kj 



a 



= - — , i^k 

dkk 

= j^k 

akk 

a ik a kj . . / i 

a-kk 



Both sweeping and inverse sweeping preserve symmetry. Thus, all operations can be carried out on 
either the lower or upper triangle of A alone, saving both computational time and storage. When 
several sweeps or inverse sweeps are performed, their order is irrelevant. Finally, a symmetric 
matrix A is positive definite if and only if A can be completely swept, and all of its diagonal entries 
remain positive until swept. Complete sweeping produces — A -1 . Each sweep of a positive definite 
matrix reduces the magnitude of the unswept diagonal entries. Positive definite matrices with poor 
condition numbers can be detected by monitoring the relative magnitude of each diagonal entry 
just prior to sweeping. 

At the start of path following, we initialize a path tableau with block entries 



-A 


-XJ* 


b 


* 







* 


* 






(17) 

The starred blocks here are determined by symmetry. Sweeping the diagonal entries of the upper- left 
block A of the tableau yields 



A- 1 


A" 1 17* 


-A~ 1 b 


* 


UA~ i U t 


-UA- L b-c 


* 


* 


b'A^b 



The new tableau contains the unconstrained solution x(0) = —A x 6 and the corresponding con- 
straint residuals —UA~ 1 b — c. In path following, we adopt our previous notation and divide the 
original tableau into sub-blocks. The result 

\ 

(18) 

V 



-A 


-u% 


-17*- 
u z 


b 


* 








-c z 


* 







~Cz 


* 


* 


* 
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highlights the active and inactive constraints. If we continue sweeping until all diagonal entries of 
the upper-left quadrant of this version of the tableau are swept, then the tableau becomes 



p 


Q 


PU% 




Pb + Qcz 


* 


R 


Q l U% 




Q l b + Rc z 




* 


UzPU l £ 


Uz{- 


Pb + Qcz) - eg 


* 


* 


* 


tfPb 


- 2b t Qc z + c l z Rc z 



All of the required elements for the path algorithm now magically appear. 

Given the next p, the solution vector x(p) appearing in equation requires the sum — Pb + 
Qcz , which occurs in the revised tableau, and the vector Pu j . If r g denotes the coefficient vector 
for the inactive constraints, with entries of —1 for constraints in Ae, for constraints in A/"i, and 
1 for constraints in Ve U "Pi, then Pu z = PUgrg- Fortunately, PUg appears in the revised 
tableau. The update of p depends on the hitting times (fT5|) and (fT^j) . These in turn depend on 
the numerators —v\Pb + v\Qcz ~ di and —w^Pb + w^Qcz — ej, which occur as components of 
the vector U z(—Pb + Qcz) — eg, and the denominators v\Puz and w^Pu^, which occur as 
components of the matrix U gPU^r z computable from the block U gPl) j of the tableau. The 
escape times for the active constraints also determine the update of p. According to equation (|16p , 
the escape times depend on the current coefficient vector, the current value po of p, and the vector 
Q l Uz = Q*C/gT ' Z-, which can be computed from the block Q t U t z of the tableau. Thus, the revised 
tableau supplies all of the ingredients for path following. Algorithm [1] outlines the steps for path 
following ignoring the anomalous situations. 

The ingredients for handling the anomalous situations can also be read from the path tableau. 
The initial coefficients rz(0) — —Q tig = Q Usr£ are available once we sweep the tableau Q17[l 
on the diagonal entries corresponding to the constraints in Z at the point cc(0) = — A~ l b. As 
noted earlier, if the coefficients of several active constraints are simultaneously poised to exit their 
subdiffcrcntials, then one must consider all possible swept and unswept combinations of these con- 
straints. The operative criteria for choosing the right combination involve the available quantities 
Q l Uz and Q l b + Rcz- One of the sweeping combinations is bound to give a correct direction for 
the next extension of the path. 

The computational complexity of path following depends on the number of parameters m and 
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the number of constraints n = r + s. Computation of the initial solution —A~ 1 b takes about 3m 3 
floating point operations (flops). There is no need to store or update the P block during path 
following. The remaining sweeps and inverse sweeps take on the order of n(m + n) flops each. 
This count must be multiplied by the number of segments along the path, which empirically is on 
the order of 0{n). The sweep tableau requires storing (m + n) 2 real numbers. We recommend all 
computations be done in double precision. Both flop counts and storage can be halved by exploiting 
symmetry. Finally, it is worth mentioning some computational shortcuts for the multi-task learning 
model. Among these are the formulas 

(J(g)X)*(J(g)X) = I®X l X 
(I(g> X^y 1 = Kg^X)- 1 

{i®x t xy 1 i®v = Kgiix'x^v 

(ItgiX'X^KgiW = I^iX'X^W. 

Algorithm 1 Solution path of the primal problem ([6]) when A is positive definite. 

Initialize k = 0, po = 0, and the path tableau (fTTl) . Sweep the diagonal entries of A. Enter the 

main loop. 

repeat 

Increment k by 1. 

Compute the hitting time or exit time p^ for each constraint i. 

Set p k = min{pW : p w > pk-i}- 

Update the coefficient vector by equation (fT6|) . 

Sweep the diagonal entry of the inactive constraint that becomes active or inverse sweep the 
diagonal entry of the active constraint that becomes inactive. 
Update the solution vector Xf. = x(pk) by equation (fTT|) . 
until Af E = P E = Pi = 



5 Extensions of the Path Algorithm 

As just presented, the path algorithm starts from the unconstrained solution and moves forward 
along the path to the constrained solution. With minor modifications, the same algorithm can start 
in the middle of the path or move in the reverse direction along it. The latter tactic might prove 
useful in lasso and fused-lasso problems, where the fully constrained solution is trivial. In general, 
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consider starting from x(po) at a point po on the path. Let Z = Ze U Z\ continue to denote the 
zero set for the segment containing po- Path following begins by sweeping the upper left block of 
the tableau (TT51) and then proceeds as indicated in Algorithm [TJ Traveling in the reverse direction 
entails calculation of hitting and exit times for decreasing p rather than increasing p. 

Our assumption that A is positive definite automatically excludes underdetermined statistical 
problems with more parameters than cases. Here we briefly indicate how to carry out the exact 
penalty method when this assumption fails and the sweep operator cannot be brought into play. In 
the absence of constraints, f(x) lacks a minimum if and only if either A has a negative eigenvalue 
or the equation Ax = b has no solution. In either circumstance a unique global minimum may 
exist if enough constraints are enforced. Suppose x(po) supplies the minimum of the exact penalty 
function £ p (x) at p = po > 0. Let the matrix TJ z hold the active constraint vectors. As we slide 
along the active constraints, the minimum point can be represented as x(p) — x(po) + Yy(p), 
where the columns of Y are orthogonal to the rows of TJ z ■ One can construct Y by the Gramm- 
Schmidt process; Y is then the orthogonal complement of TJ z in the QR decomposition. The active 
constraints hold because Uzx(p) = Uzx(po) = cz- 

The analogue of the stationarity condition ([7]) under reparameterization is 

= Y t AYy(p) + Y t b + pY t Uz. (19) 

The inactive constraints do not appear in this equation because v\Y = and WjY = for i or j 
active. Solving for y(p) and x{p) gives 

y(j>) = -(Y'AYy^Y'b + pY'uz) 

x(p) = x(p )-Y(Y t AY)- 1 (Y t b + P Y t u 2 ) (20) 

and does not require inverting A. Because the solution x(p) is affine in p, it is straightforward to 
calculate the hitting times for the inactive constraints. 

Under the original parametrization, the Lagrange multipliers and corresponding active coeffi- 
cients appearing in the stationarity condition ([7]) can still be recovered by invoking equation 
Again it is a simple matter to calculate exit times. The formulas are not quite as elegant as 
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those based on the sweep operator, but all essential elements for traversing the path are available 
Adding or deleting a row of the matrix U z can be accomplished by updating the QR decompo 



sition. The fast algorithms for this purpose simultaneously update Y ([Lawson and Hanson. 1987 



Nocedal and Wright. 20061) . More generally for equality constrained problems generated by the 
lasso and generalized lasso, the constraint matrix U z as one approaches the penalized solution is 
often very sparse. The required QR updates are then numerically cheap. For the sake of brevity, 
we omit further details. 

6 Degrees of Freedom Under Affine Constraints 

We now specialize to the least squares problem with the choices A = X t X, b = —X t y, and x(p) = 
0{p) and consider how to define degrees of freedom in the presence of both equal i ty and inequality 



constraints. As previous authors (jEfron et al.. 2004 ; 



Tibshirani and Taylor, 2011 



have shown, the most productive approach relies on Stein's characterization 



Zou et al. 20071) 



Efron. 2004 ; 



Stein. 19811) 



df(y) 



E £ 



dyi 



= E[tr(rf y y)] 



of the degrees of freedom. Here y — Xf3 is the fitted value of y, and d y y denotes its differential 
with respect to the entries of y. Equation (ITT1) implies that 

y = Xj3 = XPXty + XQcz- pXPu 2 . 

Because p is fixed, it follows that d y y = XPX 1 . The representation 

XPX t 

= x{x t x)- 1 x t - x{x l ■x)- 1 u t z [u z {x* xy^-u^u z {x l xy 1 x l 
= Pi -P2 

and the cyclic permutation property of the trace function applied to the matrices P\ and P2 yield 
the formula 



E[tr(d y y)] = m-E(|Z|), 
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where m equals the number of parameters. In other words, m — \Z\ is an unbiased estimator of the 
degrees of freedom. This result obviously depends on our assumptions that X has full column rank 
m and the constraints Vi and Wj are linearly independent. The latter condition is obviously true 
for lasso and fused-lasso problems. The validity of Stein's for mula requir es the fitted value y to be 
a continuous and almost surely differentiable function of y (IStein. 1981). Fortunately, this i s the 



case for lasso (|Zou et al.. 20071 ) and generalized las so problems 



Tibshirani an d Taylor. 20111) and 



for at least one case of shape-restricted regression ([Meyer and Woodroofe. 20001 ) . Our derivation 
does not depend directly on whether the constraints are equality or inequality constraints. Hence, 
the degrees of freedom estimator can be applied in shape-restricted regression using model selection 
criteria such as C p , AIC, and BIC along the whole path. The concave regression example in the 
introduction illustrates the general idea. 

7 Examples 

Our examples illustrate both the mechanics and the potential of path following. The path algo- 
rithm's ability to handle inequality constraints allows us to obtain path solutions to a variety of 
shape-restricted regressions. Problems of this sort may well dominate the future agenda of non- 
parametric estimation. 

7.1 Two Toy Examples 



Our first example (jLawson and Hanson. 19871) fits a straight line y = /3q + x/3\ to the data points 



(0.25,0.5), (0.5,0.6), (0.5,0.7), and (0.8,1.2) by minimizing the least squares criterion \\y — X/3\\ 



subject to the constraints 



/3i > 0, /3 >0, A) + )8i<l. 



In our notation 
A = 

W = 



x l x = 




4.0000 2.0500 
2.0500 1.2025 




b = X l y = 



-3.0000 
-1.7350 
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The initial tableau is 



-4.0000 


-2.0500 


1 


1 


-1 


-3.0000 \ 


-2.0500 


-1.2025 








-1 


-1.7350 


1 

















1 

















-1 


-1 











-1 


-3.0000 


-1.7350 








-1 


o / 



Sweeping the first two diagonal entries produces 



/ 1.9794 -3.3745 
-3.3745 6.5844 



-1.9794 3.3745 
3.3745 -6.5844 
-1.3951 3.2099 



V 0.0835 1.3004 



-1.9794 3.3745 
3.3745 -6.5844 



-1.3951 
3.2099 



1.9794 -3.3745 1.3951 
-3.3745 6.5844 -3.2099 
1.3951 -3.2099 1.8148 



0.0835 \ 
1.3004 



-0.0835 
-1.3004 
0.3840 



2.5068 / 



-0.0835 -1.3004 0.3840 
from which we read off the unconstrained solution (3(0) = (0.0835, 1.3004)' and the constraint 
residuals (-0.0835,-1.3004, 0.3840)'. The latter indicates that M = {1, 2}, Zj = 0, and V\ = {3}. 
Multiplying the middle block matrix by the coefficient vector r — (0, 0, 1)' and dividing the residual 
vector entrywise give the hitting times p = (-0.0599,0.4051,0.2116). Thus p x = 0.2116 and 



/3(0.2116) 



0.0835 
1.3004 



0.2116 x 



-1.3951 
3.2099 



0.3787 
0.6213 



Now M = {1,2}, Z = {3}, V — 0, and we have found the solution. Figure [5] displays the data 
points and the unconstrained and constrained fitted lines. 



Our second toy example concerns the toxin response problem (jSchoenfeld. 19861 ) with m toxin 
levels x\ < X2 < ■ • • < x m and a mortality rate yi = f(xi) at each level. It is reasonable to 
assume that the mortality function f(x) is nonnegative and increasing. Suppose y+ are the observed 



death frequencies averaged across rij trials at level xi. In a finite sample, the 



Hi may fail to b e 



nondecreasing. For example, in an EPA study of the effects of chromium on fish (jSchoenfeld. 1986 ) 
the observed binomial frequencies and chromium levels are 

y = (0.3752,0.3202,0.2775,0.3043,0.5327)' 
x = (51,105,194,384,822)' in pg/l. 



Isotonic regression minimizes 53fc=i(yfe 



subject to the constraints < 9\ < ■ ■ ■ < 6 m on 



the binomial parameters 6k = f(xk)- The solution path depicted in Figure [3] is continuous and 
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— constrained 



0.2 0.4 0.6 0.8 1 

x 



1.2 



Figure 2: The data points a nd the fitted lines for the first toy example of constrained curve fitting 
( Lawson and Hanson. 1 987). 



piecewise linear as advertised, but the coefficient paths are nonlinear. The first four binomial 
parameters coalesce in the constrained estimate. 

7.2 Generalized Lasso Problems 



The generalized lasso problems studied in (|Tibshirani and Tavlor. 201lh all reduce to minimization 
of some form of the objective function (|6]). To avoid repetition, we omit detailed discussion of 
this class of problems and simply refer readers interested in applications to lasso or fused-lasso 
penalized regression, outlier d etections, trend filtering, and image restoration to the original article 



( Tibshirani and Taylor. 20111) . Here we would like t o point out the re levance of the generalized 
lasso problems to graph-guided penalized regression (jChen et al.. 20101 ). Suppose each node i of 
a graph is assigned a regression coefficient and a weight In graph penalized regression, the 



objective function takes the form 

h\W(y-X(3)\\l + ^Yl 



A ._/_ x ft 



sgn(rij)- 



(21) 



where the set of neighboring pairs i ~ j define the graph, di is the degree of node i, and is the 
correlation coefficient between i and j. Under a line graph, the objective function (|21j) reduces to 
the fused lasso. In 2-dimensional imaging applications, the graph consists of neighboring pixels in 
the plane, and minimization of the function (|21[) is accomplished by total variation algorithms. In 
MRI images, the graph is defined by neighboring pixels in 3 dimensions. Penalties are introduced 
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in image reconstruction and restoration to enforce smoothness. In microarray analysis, the graph 
reflects gene networks. Smoothing the /3i over the network is motivated by the assumption that the 
expression levels of related ge nes should rise an d fall in a coordinated fashion. Ridge regularization 



in graph penalized regression ([Li and Li. 20081 ) is achieved by changing the objective function to 

2 



\W(y-Xf3)\\l 



ft 



sgn(r;y) 



ft 



If one fixes either of the tuning constants in these models, our path algorithm delivers the solution 
path as a function of the other tuning constant, Alternatively, one can fix the ratio of the two 
tuning constants. Finally, the extension 



K 



sgn(r i: 



l -\\Y-XB\\l + \ G Y^Y. 

»~i fc=i 

of the objective function to multivariate response models is obvious 



ftsi 

\fd~i 



In principle, the path algorithm applies to all of these problems provided the design matrix 
X has full column rank. If X has reduced rank, then it is advisable to add a sm all amount 
of ridge regularization e^ift 2 to the objective function (jTibshirani and Tavlor. 20111 ). Even so, 



computation of the unpenalized solution may be problematic in high dimensions. Alternatively, path 
following can be conducted starting from the fully constrained problem as suggested in Section [5] 

7.3 Shape Restricted Regressions 



Order-constrained regres sion is now widely accepted as an important modeling tool ([Robertson et al.. 1988 



Silvapulle and Sen. 20051 ) . If (3 is the parameter vector, monotone regression includes isotone con- 
straints /3i < /?2 < • • • < ftn or antitone constraints 0% > 02 > • • • > (3 m . In partially ordered 
regression, subsets of the parameters are subject to isotone or antitone constraints. In other prob- 
lems it is sensible to impose convex or concave constraints. If observations are collected at irregularly 
spaced time points t\ < < • • • < t m , then convexity translates into the constraints 



ft+2 - 13, 



i+l 



> 



A+i - ft 



t'i+2 — U+l U+i — ti 

for 1 < i < m — 2. When the time intervals are uniform, these convex constraints become ft +2 — 
ft+i > ft+i ~ ft- Concavity translates into the opposite set of inequalities. All of these shape 
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restricted regression problems can be solved by path following. 

As an example of p artial isotone regression, we fit the data from Table 1.3.1 of the reference 



( Robertson et al.. 19881) on the first-year grade point averages (GPA) of 2397 University of Iowa 
freshmen. These data can be downloaded as part of the R package ic . infer. The ordinal predictors 
high school rank (as a percentile) and ACT (a standard aptitude test) score are discretized into 
nine ordered categories each. A rational admission policy based on these two predictor sets should 
be isotone separately within each set. Figure U shows the unconstrained and constrained solutions 
for the intercept and the two predictor sets and the solution path of the regression coefficients for 
the high school rank p redictor. 



The same authors ([Robertson et al.. 19881 ) predict the probability of obtaining a B or better 
college GPA based on high school GPA and ACT score. In their data covering 1490 college students, 
yij is the proportion of students who obtain a B or better college GPA among the students who 
are within the ith ACT category and the jth high school GPA category. Prediction is achieved 
by minimizing the criterion 53j 53 j 'H? (fe — subject to the matrix partial-order constraints 
6 n > 0, 9ij < 6%+i,j, and 9ij < Oij+i- Figure [5] shows the solution path and the residual sum of 
squares and effective degrees of freedom along the path. The latter vividly illustrates the tradeoff 
between goodness of fi t and degrees of freedom. Readers can consult page 33 of the reference 



( Robertson et al.. 1 988) for the original data and the constrained parameter estimates. 



7.4 Nonparametric Shape-Restricted Regression 

In this section we visit a few problems amenable to the path algorithm arising in nonparametric 
statistics. Given data (xi, j/j), i = 1, . . . , n, and a weight function w(x), nonparametric least squares 
seeks a regression function 9{x) minimizing the criterion 

n 

£>(xi)[yi - e{xi)f (22) 

i=l 

over a space C of functions with shape restrictions. In concave regression for instance, C is the 
space of concave functions. This seemingly intractable infinite dimensional problem can be sim- 
plified by minimizing the least squares criterion subject to inequality constraints. For a uni- 
variate predictor and concave regression, the constraints Q arc pertinent. The piecewise lin- 
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ear function extrapolated from the estima ted 9j is clearly concave. T he consistency of concav- 



ity constrained least squares is proved by 



Hanson and Pledger (19761) ; the asymptotic distribu- 



tion of the corresponding estimator and i ts rate of convergence are investigated in later papers 



(Groeneboom et al.. 2001 



dictors include m onotonicity 



BrunkU955 



Mammc n. 1991). Other relevant shape rest rictions for univariate pre 



Grenander. 19561 ) , convexity ([Groeneboom et al.. 20011 ) 



supermodularity (jBeresteanu. 20041 ). and combinations of these. 

Multidimensional nonparametric estimation is much harder because there is no natural order 
on ffi d when d > 1. One fruitful approac h to shape- restricted regression relies on sieve estimators 



( Ber esteanu. 2 004; 



Shen and Wong. 19941 ) . The general idea is to introduce a basis of local functions 



(for example, normalized B-splines) centered on the points of a grid G spanning the support of the 
covariate vectors a^. Admissible estimators are then limited to linear combinations of the basis 
functions subject to restrictions on the estimates at the grid points. Estimation can be formalized 
as minimization of the criterion \\y — \&(X)#||2 subject to the constraints C*&(G)9 < 0, where 
\&(X) is the matrix of basis functions evaluated at the covariate vectors x^, *&(G) is the matrix 
of basis functions evaluated at the grid points, and 9 is a vector of regression coefficients. The 
linear inequality constraints incorporated in the matrix C reflect the required shape restrictions 
required. Estimation is performed on a sequence of grids (a sieve). Control l ing the rate at which th e 



Shen and Wong. 19941 ) 



sieve sequence converges yields a consistent estimator ([Beresteanu. 2004 ; 
Prediction reduces to interpolation, and the path algorithm provides a computational engine for 
sieve estimation. 

A related but different approach for multivariate convex regression minimizes the least squares 
criterion ([3]) subject to the constraints $,l(xj ~ xi) < 9j — 9i for every ordered pair In effect, 

9i is viewed as the value of the regression function 9(x) at the point Xi. The unknown vector £ 4 
serves as a subgradient of 9{x) at Xi. Because convexity is preserved by maxima, the formula 



6{x) 



max 

j 



+ £){x - xj) 



defines a convex function with value 9i at x — ccj. In concave regression the opposite constraint 
inequalities are imposed. Interpolation of predicted values in this model is accomplished by sim- 
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ply taking minima or maxima. Estimation reduces to a positive semidefinite quadratic program 
involving n(d + 1) variables and n(n — 1) inequality constraints. Note that the feasible region is 
nontrivial because setting all 9i = and all = works. In implementing the extension of the path 
algorithm mentioned in Section [SJ the large number of constraints may prove to be a hindrance and 
lead to very short path segments. To improve estimation of the subgradients, it might be worth 
adding a small multiple of the ridge penalty ^ to the objective function ([3]). This would have 
the beneficial effect of turning a semidefinite quadratic program into a positive definite quadratic 
program. 

8 Conclusions 

Our new path algorithm for convex quadratic programming under affine constraints generalizes 
previous path algorithms for lasso penalized regression and generalized lasso penalized regression. 
By directly attacking the primal problem, the new algorithm avoids the circuitous tactic of solving 
the dual problem and translating the solution back to the primal problem. Our various examples 
confirm the path algorithm's versatility. It's potential disadvantages involve computing the initial 
point — A~ 1 b and storing the tableau. In problems with large numbers of parameters, neither of 
these steps is trivial. However, if A has enough structure, then an explicit inverse may exist. As we 
noted, once A -1 is computed, there is no need to store the entire tableau. The multi-task regression 
problem with a large number of responses per case is a typical example where computation of A 
simplifies. In settings where the matrix A is singular, parameter constraints may compensate. We 
have briefly indicated how to conduct path following in this circumstance. 

Our path algorithm qualifies as a general convex quadratic program solver. Custom algo- 
rithms have been developed for many special cases of quadratic programming. For example, 
the pool-ad.jacent-viola tors algorithm (PAVA) is now the standard approach to isotone regression 



(jde Leeuw et al.. 20 09*1 . The other generic methods of quadratic programming include active set 
and interior point methods. A comparison with our path algorithm would be illuminating, but in 
the interests of brevity we refrain from tackling the issue here. The path algorithm bears a stronger 
resemblance to the active set method. Indeed, both operate by deleting and adding constraints to 
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a working active set. However, the active set method must start with a feasible point, and interior 
point methods must start with points in the relative interior of the feasible region. The path algo- 
rithm's ability to deliver the whole regularized path with little additional computation cost beyond 
constrained estimation is bound to be appealing to statisticians. 
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Figure 3: Toxin response example. Left: Solution path. Right: Coefficient paths for the constraints. 
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Figure 4: Left: Unconstrained and constrained estimates for the Iowa GPA data. Right: Solution 
paths of for the regression coefficients corresponding to high school rank. 




Figure 5: GPA prediction example. Left: Solution path for the predicted probabilities. Right: 
Residual sum of squares and effective degrees of freedom along the path. 



20 



